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ABSTRACT 

Summary: The current methods available to detect chromosomal 
abnormalities from DNA microarray expression data are cumbersome 
and inflexible. CAFE has been developed to alleviate these issues. It is 
implemented as an R package that analyzes Affymetrix *.CEL files and 
comes with flexible plotting functions, easing visualization of chromo- 
somal abnormalities. 

Availability and implementation: CAFE is available from https://bit 
bucket.org/cob87icW6z/cafe/ as both source and compiled packages 
for Linux and Windows. It is released under the GPL version 3 license. 
CAFE will also be freely available from Bioconductor. 
Contact: sander.h.bollen@gmail.com or nancy.mah@mdc-berlin.de 
Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

Gross chromosomal abnormalities are a hallmark of cancers 
(Hanahan and Weinberg, 2011) and are frequently acquired by 
cultured cells as an adaptation to cell culture conditions (Baker 
et aL, 2007). Recently, it has been recognized that induced pluri- 
potent stem cells often feature gross chromosomal duplications 
or deletions (Laurent et aL, 2011). 

Various methods exist to detect chromosomal gains or losses. 
Traditional karyotyping relies on careful examination of 
Giemsa-stained metaphase chromosomes. Newer techniques 
like spectral karyotyping have increased ease of analysis but 
nevertheless feature low resolution. For high-throughput and 
high resolution analysis of gross chromosomal abnormalities, 
array-based Comparative Genomic Hybridization (a-CGH) is 
often used. This a-CGH approach is based on the detection of 
a quantitative difference of DNA content. Whole-genome and 
SNP-based sequencing approaches have also been developed. 

Although not initially developed for this purpose, it is possible 
to use gene expression microarray data for the detection of copy 
number abnormalities. This approach is not based on the meas- 
urements of DNA content but rather on mRNA expression 
levels. A protocol to use expression microarrays to 'karyotype' 
samples was recently pubHshed by Benvenisty and coworkers but 
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requires the manual use of different tools (Ben-David et aL, 
2013). 

Here, we present CAFE — Chromosomal Aberration Finder in 
Expression data — as an R package for the detection of gross 
chromosomal gains and losses from expression microarrays, 
with a resolution up to cytoband level. CAFE follows the expres- 
sion-based karyotyping workflow (e-karyotyping) and greatly 
simphfies and speeds up the detection analysis of chromosomal 
aberrations from expression DNA microarrays. 

2 FEATURES AND METHODS 

The starting point of a CAFE analysis is a set of gene expression micro- 
arrays from samples whose e-karyotype will be computed and another 
(larger) set of microarrays representing controls. The controls define a 
normal e-karyotype against which the altered e-karyotype will be defined. 
We recommend choosing a dataset of at least 10 controls for 2-3 test 
samples. More controls may be required depending on the particular 
case. CAFE is implemented as an R package and relies on several 
Bioconductor packages. It runs on version 2.10 or newer of R. The ana- 
lyses are performed using Affymetrix *.CEL files as input. Using the 
ProcessCels ( ) function, a list object is created from these *.CEL 
files and returned to the user. This object contains normalized and relative 
expression levels, along with several mappings of probesets to chromo- 
somes, chromosomal arms, cytobands and chromosomal locations. The 
output can be further filtered so as to exclude multiple probesets that map 
to the same gene or to the same location. CAFE can then be used to 
perform several enrichment tests for the detection of duplications or dele- 
tions of chromosomes, chromosomal arms and cytobands. Furthermore, 
several plot functions are available to visualize any detected aberrations. 

2.1 Enrichment testing 

CAFE contains three statistical functions that determine enrichment or 
depletion of a given chromosome/chromosomal region. One function 
exists for each of the three resolutions: chromosomes tats ( ) , 
armStats ( ) and bandStatsO, corresponding to chromosomes, 
arms and cytobands, respectively. The ability of CAFE to detect aberra- 
tions within chromosome, arm or cytoband is dependent on the density of 
the microarray probesets within these areas. Areas that are gene-poor are 
not Hkely to be detected, as expression microarrays are designed to detect 
transcribed genes. The user defines two thresholds as a ratio of median 
expression values, 'over' and 'under', for which probesets are called over- 
and under-expressed. The threshold for genomic DNA hybridized onto a 
comparative genomic hybridization array would be ±log2(2) ratio to 
detect a chromosome gain or loss, respectively. Because CAFE uses 
mRNA expression as a surrogate for DNA copy number and because 
the levels of mRNA expression are variable and do not strictly reflect 
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Fig. 1. Samples GSM652238 (orange) and GSM652239 (blue) from GEO 
dataset GSE26526 were plotted. GSM652238 is known to have a deletion 
in Chromosome llq. (A) The output of slidPlot ( ) . (B) The output of 

discontPlot ( ) 

DNA copy number, a less restrictive default threshold is recommended as 
a starting point for analysis: ±log2(1.5) ratio. Using these thresholds, 
enrichment of under- and over-expressed probes in chromosomes or 
chromosomal regions is computed using a Fisher's exact test or a 
test. P-values are Bonferroni-corrected by default. 

2.2 Graphics 

For graphics, CAFE uses the ggplot2 plotting system (Wickham, 2009). 
There are four different plotting functions available: (i) rawPlot ( ) : the 
'raw' log-transformed relative expression values are plotted along the 
chromosome of interest; (ii) slidPlot ( ) : a moving average smoother 
is applied to the log-transformed relative expression values before plotting 
the values along the chromosome of interest; (iii) discontPlotO: a 
discontinuous smoother is appHed to the log-transformed relative expres- 
sion values and the values are plotted along the chromosome of interest; 
(iv) f acetPlot ( ) : all chromosomes are plotted in one horizontally 
aligned graph, with relative expression values along each chromosome. 
This function can be used in conjunction with a moving average smoother. 

For all plot functions except f acetPlot ( ) , it is possible to add a 
chromosome idiogram over the chromosome plot. This allows easy visu- 
alization of chromosomal abnormalities. See Figure 1 for example plots. 
All plots are printed to the file system and returned as ggplot2 objects. 
Plot parameters, such as labels and scales, can be modified to the user's 
liking by altering the ggplot2 object. 

2.3 Comparison with other tools 

To the best of our knowledge, there are currently no other Bioconductor 
packages that are designed to identify chromosomal copy number 
abnormalities from mRNA expression microarray data. However, there 
are other packages that perform similar functions, such as processing 
comparative genomic hybridization arrays (aCGH, snapCGH) or iden- 
tifying differentially regulated regions on a chromosome from expression 
microarrays (MACAT). The initial choice of datasets for analysis is crit- 
ical and must be completed by hand, regardless of the package used. Once 
this is done, CAFE is able to normalize and preprocess the *.CEL files in 
one step. Although aCGH and snapCGH were designed to analyze CGH 
arrays, one can use CAFE to preprocess the *.CEL files and then subse- 
quently reformat the preprocessed data for input into these packages. 
Both aCGH and snapCGH use hidden Markov models to predict state 
changes (i.e. changes in chromosomal copy number). Plotting functions 
then show the course of state changes over the chromosome. MACAT 
uses a modified /-statistic and permutation to score regions of the 
chromosome that are differentially regulated. The scores for a selected 
chromosome are shown as a static html page. To compare the perform- 
ance, CAFE and the other three packages were used to analyze two test 
datasets with known chromosomal aberrations (see Supplementary 
Data). CAFE was able to detect copy number abnormalities just as 
well as or better than the other packages. 



3 DISCUSSION 

Karyotyping by expression microarrays, as described by Ben- 
David et al. (2013), extends the utihty of expression microarray 
data by providing some limited information on the status of 
chromosomal aberrations in a sample. However, the original 
e-karyotyping method is a tedious process, using four different 
programs and requiring an estimated 15 h to analyze only 15 
samples. At the time of writing, there are no Bioconductor pack- 
ages to specifically carry out e-karyotyping from raw microarray 
data. The CAFE package simplifies the e-karyotyping protocol. 
Starting from the *.CEL files, CAFE can do the same analysis in 
minutes and requires no more than basic R knowledge. 
Bioconductor packages for CGH analysis can be used to perform 
an e-karyotyping analysis, but the data preprocessing steps 
must be carried out manually and the resulting graphs are 
static and not user-configurable. In contrast, CAFE processes 
the expression data from *.CEL files and all plotting functions 
return a ggplot2 object that can be modified by the end-user to 
his or her specific needs. CAFE will become a Bioconductor 
package, to be freely available for anyone to use, distribute, 
modify and easily integrate into an R- workflow for automatic 
analysis. 

Currently, only Affymetrix *.CEL files from 3TVT arrays can 
be seamlessly preprocessed in CAFE; functions for processing 
raw microarray data from other platforms may be added in 
the future if there is demand for this feature. Data import 
from other platforms is still possible, as CAFE data are 
represented by a simple R list structure. CAFE analysis currently 
works with three different resolutions: whole chromosome, 
chromosomal arm and cytoband. Therefore, it is not suited 
for smaller deletions or duplications. In addition, it is not 
readily possible to detect insertions or translocations using 
this technique, as the probesets will be mapped to the ori- 
ginal chromosome or location. Therefore, CAFE is most suited 
for the detection of numerical abnormalities, rather than 
structural abnormalities. CAFE has not been designed to 
replace existing karyotyping techniques but rather to gain infor- 
mation when data from more specific approaches is not yet 
available. 
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